Goto

Collaborating Authors

 listwise deletion


Generative AI in Sociological Research: State of the Discipline

arXiv.org Artificial Intelligence

Generative artificial intelligence (GenAI) has garnered considerable attention for its potential utility in research and scholarship. A growing body of work in sociology and related fields demonstrates both the potential advantages and risks of GenAI, but these studies are largely proof-of-concept or specific audits of models and products. We know comparatively little about how sociologists actually use GenAI in their research practices and how they view its present and future role in the discipline. In this paper, we describe the current landscape of GenAI use in sociological research based on a survey of authors in 50 sociology journals. Our sample includes both computational sociologists and non-computational sociologists and their collaborators. We find that sociologists primarily use GenAI to assist with writing tasks: revising, summarizing, editing, and translating their own work. Respondents report that GenAI saves time and that they are curious about its capabilities, but they do not currently feel strong institutional or field-level pressure to adopt it. Overall, respondents are wary of GenAI's social and environmental impacts and express low levels of trust in its outputs, but many believe that GenAI tools will improve over the next several years. We do not find large differences between computational and non-computational scholars in terms of GenAI use, attitudes, and concern; nor do we find strong patterns by familiarity or frequency of use. We discuss what these findings suggest about the future of GenAI in sociology and highlight challenges for developing shared norms around its use in research practice.


Which Imputation Fits Which Feature Selection Method? A Survey-Based Simulation Study

arXiv.org Machine Learning

Tree-based learning methods such as Random Forest and XGBoost are still the gold-standard prediction methods for tabular data. Feature importance measures are usually considered for feature selection as well as to assess the effect of features on the outcome variables in the model. This also applies to survey data, which are frequently encountered in the social sciences and official statistics. These types of datasets often present the challenge of missing values. The typical solution is to impute the missing data before applying the learning method. However, given the large number of possible imputation methods available, the question arises as to which should be chosen to achieve the 'best' reflection of feature importance and feature selection in subsequent analyses. In the present paper, we investigate this question in a survey-based simulation study for eight state-of-the art imputation methods and three learners. The imputation methods comprise listwise deletion, three MICE options, four \texttt{missRanger} options as well as the recently proposed mixGBoost imputation approach. As learners, we consider the two most common tree-based methods, Random Forest and XGBoost, and an interpretable linear model with regularization.


Graphical Models for Inference with Missing Data

Neural Information Processing Systems

We address the problem of recoverability i.e. deciding whether there exists a consistent estimator of a given relation Q, when data are missing not at random. We employ a formal representation called'Missingness Graphs' to explicitly portray the causal mechanisms responsible for missingness and to encode dependencies between these mechanisms and the variables being measured. Using this representation, we derive conditions that the graph should satisfy to ensure recoverability and devise algorithms to detect the presence of these conditions in the graph.


Regression with Missing Data, a Comparison Study of TechniquesBased on Random Forests

arXiv.org Machine Learning

Random forests and recursive trees are widely used in applied statistics and computer science. The popularity of recursive trees relies on several factors: their easy interpretability, the fact that they can be used for both regression and classification tasks, the small number of hyper-parameters to be tuned and finally, their non-parametric nature that allows their use to infer arbitrarily complex relations between the input and the output space. A random forest combines several randomized trees, improving the prediction accuracy at a cost of a slight lost in interpretation. This technique is easily parallelizable which has made it one of the most popular tools for handling high dimensional data sets. It has been successfully involved in various practical problems, including chemioinformatics, ecology, 3D object recognition, bioinformatics and econometrics. Biau and Scornet (2016) present a detailed list of applications as well as a review on random forests. In the present work we have focused on the ability of random forests to deal with missing values.


Missing Data Handling

#artificialintelligence

Real-world data is messy and usually holds a lot of missing values. Missing data can skew anything for data scientists and, A data scientist doesn't want to design biased estimates that point to invalid results. Behind, any analysis is only as great as the data. Missing data appear when no value is available in one or more variables of an individual. Due to Missing data, the statistical power of the analysis can reduce, which can impact the validity of the results.


Graphical Models for Inference with Missing Data

Neural Information Processing Systems

We address the problem of deciding whether there exists a consistent estimator of a given relation Q, when data are missing not at random. We employ a formal representation called `Missingness Graphs' to explicitly portray the causal mechanisms responsible for missingness and to encode dependencies between these mechanisms and the variables being measured. Using this representation, we define the notion of \textit{recoverability} which ensures that, for a given missingness-graph $G$ and a given query $Q$ an algorithm exists such that in the limit of large samples, it produces an estimate of $Q$ \textit{as if} no data were missing. We further present conditions that the graph should satisfy in order for recoverability to hold and devise algorithms to detect the presence of these conditions.